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£N| . Abstract 
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• On-line learning of a rule given by an N-dimensional Ising perceptron, is considered 

for the case when the student is constrained to take values in a discrete state space of 
— , size L N . For L = 2 no on-line algorithm can achieve a finite overlap with the teacher 

in the thermodynamic limit. However, if L is on the order of \^N, Hebbian learning 
^ does achieve a finite overlap. 
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Artificial neural networks are usually trained by a set of examples 01 . After the training 

O ■ 

phase such a network (=" student") has achieved some knowledge about the rule (=" teacher") 

j5 : 

which has generated the examples. The difference between the outputs of the student and 
the teacher for a random input vector defines the generalization error. 

There are two basic kinds of training algorithms: 1. In batch mode the complete set 
of examples is stored and iteratively used to change the synaptic weights of the student 
network. 2. In on-line mode each example is used only once. At each training step a new 
example is presented and the synaptic weights are changed according to some algorithm. 

The analysis of on-line algorithms using methods of statistical mechanics |], §, ||, 0- § 
has shown that this is a powerful and versatile approach to learning problems. To our 
knowledge, however, only continuous couplings have so far been considered. But for hard- 
ware implementations it would be extremely useful to design algorithms which work in a 
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discrete space of synaptic weights. It is not known whether on-line algorithms work at all 
for weights which have a limited number L of possible values. Here we show for a simple 
case that generalization is only possible if L is of the order of y/~N, where N is the size of 
the network. 

We consider the perhaps simplest learning scenario in which the teacher is a perceptron 
with N binary couplings Bi G { — 1,1}. In on-line learning, the student perceptron with 
weight vector J receives at each time step an N-dimensional input £ and the classification bit 
<7b(0 £ { — 1) 1} provided by the teacher B. The task is to find a mapping, J' = /(J, £, <7b(£)) 
which updates the student J, our current approximation of B, based on this information. Of 
course, J' should be an improved approximation. Under very general conditions, we show 
in this Letter that no such mapping exists if J and J' are confined to lie, as the teacher is, 
in the set { — 1, 1}^. In a second step, we consider Hebbian learning in a discretized state 
space of size L N , and determine the generalization behavior as function of A = L/y/N. 

The classification of £ is given by <Jb(0 = sign(£? T £). Hence the quality of the approx- 
imation provided by a student J can be defined via the overlap R = N~ 1 B T J with the 
teacher. Since the students have binary components, it is convenient to have the update 
rule / specify at which sites the sign should be flipped to obtain the updated weight vector 
J'. So J[ = Jifi{J, £, o\b(0) and the take values in { — 1, 1}. The update rule will be useful 
if it improves on our current state, that is if 

N 

B T J> = ]T B.JJiJ, £, <7 B (0) > B T J . (1) 

■i=i 

Of course / cannot have any built in knowledge about the teacher but must infer information 
about B from the current pattern. Formally this can be enforced by requiring that / be 
useful not just for the single teacher B but on average, for teachers which have the same 
overlap as B with J. Denoting by (. . )b\b t j=nr the average over the uniform distribution 
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on the set of teachers which have overlap R with J, a useful / must thus fulfill: 

Ht B iJifi( J ,^*B(0)) >NR. (2) 

«=1 / B\B T J=N R 

By a gauge transformation the LHS may be written as (X)£Li Bifi(J, £, <7b(£*)) b,=nr 
where £* is given by £* = Jj£j. Using that for the Heaviside step function 8,1 — 9(<tb(£,*)) + 
8(— a b {£*)), we may rewrite (Q) as 

o-e{-i,i} i=1 Zji ' 

Under mild conditions on £, one finds that 

W(aB r O>B|E 4 fl,^Ji>0 (4) 

for any positive R in the limit of large N. Consequently the LHS of @ is maximized by 
choosing fi(J,£,cr) = 1, and the best we can do is to keep the weight vector J fixed. 

There are some special cases, where @ is not true. If just a single component of £ is 
nonzero, then <Jb{0 will of course give us the corresponding component of B and one can 
achieve R = 1 by asking iV such questions. But it is hard so see how such a strategy might 
be extended to the case of a noisy teacher. 

For more generic patterns, however, the £j will be of similar magnitude. Further, £ will 
only have a small overlap with J, that is m = J2i£,iJi/\£,\ will be of order 1. Then for large 
N, and consequently small &/|£|, the LHS of @ may be evaluated using the central limit 
theorem and yields 

which is positive. So if the components of £ are picked independently from distributions 
having bounded ratios of their variances, the fraction of inputs for which (f|) is violated 
decreases exponentially with N. 
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An even stronger statement can be made for binary inputs, G { — 1,1}. Then the large 
N expansion yielding (|5]) can only be wrong, if the input is correlated with the student 
(\m\ ^> 1). But for this case (^) may be verified by evaluating its LHS with the saddlepoint 
method. Consequently, for binary inputs, on-line learning is impossible even if queries [|7| 
are allowed. 

As it is possible to learn on-line with continuous couplings, the question arises what 
the numerical depth of the couplings must be, for on-line learning to succeed. We thus 
consider a situation where the Jj are constrained to lie in the set {1,2, . . . , L}, still with 
a binary teacher. A weight vector J is then taken to represent an estimate B of B via 
Bi = sign(Jj — L/2). For randomly chosen binary inputs, Hebbian learning may be applied 
to J by truncating to the allowed range of values: 

j' = S Jl + & a z(0 if J i + &°b(0 e {1, . . . , L} 
1 | Ji else. ^ ' 

The increments £iCXb(£) are not independent over the sites i but their covariances do decay 

as 1/N. So for large N the sites will approximately decouple, and we are left with a biased 

random walk on each site. The bias is given by 



<«>=A^ (7) 

where < ... > is an average over random vectors £. 

Let pi(t) denote the probability that J\ = I after t iterations of (H) and assume that 
B l = 1. Then 

= rpi(t) +rp 2 {t) 
Pl (t + 1) = gpt- 1 {t)+rp l+1 {t), l = 2,...,L-l 
p L (t+l) = g PL -i(t) + g PL {t) , (8) 



where r + g — 1 and g = 1/2 + l/\Z2nN for large N. The stationary solution p s of is 
pf oc (g/r) 1 . Thus for large N the asymptotic overlap R s between the estimate B and the 



teacher will approach zero if L is fixed. For L = Av N, however, one finds 

R° = 1 - . (9) 

l + e# A 

The time needed to approach the stationary distribution will scale linearly with N for 
fixed A. So let R(a) be the overlap after aN steps, assuming that initially Jj = L/2. The 
time evolution of R may then be calculated using the explicit formulas for the powers of the 
transition matrix of the random walk (^) given in ||. One finds: 

- .2... ... A 



w ^--V2/,^£ e -^ |t , + ; (2t+1) , . no) 

The resulting dependence of the overlap on A (for fixed a) is nonmonotonic as shown in 
Figure 1. For large a the sum in the above expression is dominated by the first term and R 
decays exponentially; this gives the relaxation time 

2A 2 7T , . 
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2A 2 + 7T 
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To find the behavior for L ^> viV, we need to take the limit A — ^ oo in (fLOf), that is, replace 
the sum over k by an integral. This yields 



R(a) = l- 2H(yj2a/ir) , (12) 

the result found in for the case, where one applies Hebb's rule to continuous couplings 
and clips in the end. 

We have considered only simple Hebbian learning here. However, since aN examples 
will be needed to achieve good generalization, we believe that one cannot improve on the 
scaling, L = \y/N, by using a different algorithm. 

One of the authors (W.K.) would like to thank Ido Kanter for useful discussions. The 
work of R.U. was supported by the Deutsche Forschungsgemeinschaft (DFG). 
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Figure 1: Overlap R achieved by Hebbian learning using L dif- 
ferent weight values per coupling, L = \\/N. The curves are, 
from top to bottom, for a — 10, a — 1 and a — 0.1 
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